Troll feeding: PDL vs Matlab and the Standard Hammers

david m on 2009-11-04T07:47:39

One of the members of the pdl-users mailing list asked for some help for an upcoming talk. He'd like to compare PDL and Matlab. I used to call myself a C++ programmer, and I've written a number of analysis scripts in Matlab before getting fed up with all of it and switching to Perl/PDL. So, well, here goes...

The single greatest advantage that Matlab/PDL offers over the standard hammers - C, C++, FORTRAN, and Java - is that Matlab and PDL scripts require much, much less overhead to get them to actually do something useful. I once wrote a 2000 line C++ simulation program, of which only about 150 lines were actually interesting to me as a scientist. Also, Perl scripts tend to be less of a 'my program can do everything!' solution, because perl scripts are so cheap to write. (Most perl programmers put their effort into writing modules that do everything for them, and then put it on CPAN, which makes their code far more reusable. But I digress.)

The single greatest advantage that PDL offers over Matlab is that PDL is an extension of Perl, a vastly supperior language for systems programming, file handling, data aggregation, etc. This means you'll write even less infrastructure code with PDL than you would with Matlab and it'll be easier to comprehend. In my opinion, Perl is also a more expressive than Matlab.

First a little bit of overview for numerical computing. Matlab and PDL both try to solve similar problems. As we all know, you could do all your numerical work with C, C++, FORTRAN, or Java code and - assuming you had great patience and skill - it would run really fast. However, you pay for your fast code with increased developer hours. Scripting languages like Bash, Python, and Perl take the opposite approach, assuming that you would do better to reduce your developer hours but at the expense of computational speed. We all know this is why we like Perl, because it allows us to get to solving our problems quickly. This is trade-off often favors scripting languages, but numerical programming is an important exception. Matlab and IDL, as well as Numpy and PDL, are attempts to find a happy medium between rapid development and rapid execution.

The happy medium for all of these languages lies in optimizing vectorized operations. In other words, if you have two data sets of 100 elements a piece that you want to sum, in C you would write a for loop to iterate over all the elements, but in a numerical language you would simply write $a + $b and it would give a vector/array containing elements with the sum. That sum, and many other operations, is computed using compiled C-code that iterates over the elements of $a and $b just as fast as hand-rolled C-code. The reason that PDL, Numpy, Matlab, and IDL have had such great success is because most numerical operations can actually be implemented in this vectorized way of thinking.

Note that C and FORTRAN programmers often think in terms of for loops. If you ever write a for-loop in PDL or Matlab, you're probably not doing it right, unless you're iterating through a list of files. (If you only need to act on some of your data, take a slice - using PDL::NiceSlice if you're in PDL. If you absolutely must iterate, look into MEX for Matlab and PDL::PP for PDL.)

So, using a vectorized approach makes your PDL/Matlab code almost as fast as C or FORTRAN, but it has a huge, rarely mentioned advantage: writing your code using vectorized expressions is actually more expressive and easier to understand. With less for loops hogging up screen space, you've got less extraneous code, less off-by-one errors, and can see more of your logic in one screen.

At this point, those who do their numerics in Java and C++ will get restless and note that they could achieve similar levels of expresiveness with vector/array objects and operator overloading. They are correct, and the rest of their code would run faster than the Matlab or PDL code. But then they'd be doing all of their infrastructure code in C++ or Java, which is just a headache for me and most Perl programmers. And, they probably can't do data slicing like PDL or Matlab.

At this point, even if you still plan on using your standard hammer, hopefully you understand why I prefer to work with PDL.

Now onto explaining the differences between PDL and Matlab. Here's a list of what Matlab does better than PDL:

The most jarring difference for a simple Matlab (or IDL) user will be the lack of an IDE. PDL offers a partial replacement, however, with the PDL prompt. The PDL prompt is an interactive shell, and it's possible to set up an auto-load directory in much the same way that you could set up autoloading for Matlab or IDL. Perhaps, someday PDL could have an extension for Padre that would actually make it feel more like a full IDE, but until that time comes, PDL developers will work with an open perldl or bash shell, editing scripts in their preferred editors.

PDL has 2D and 3D graphics for visualizing data, but they're not nearly as developed as in Matlab. So the simple answer to this is, "It can be done, but it may not be as easy." From what I understand, PGPLOT is a standard plotting tool for Astronomers, so it can't be all that bad. Still, Matlab makes visualization pretty darn easy and is (rightly) a major selling point.

From the numeric standpoint, Matlab has many more features than PDL. For example, there is a standard wavelet analysis toolbox in Matlab, and as far as I know, nobody has written a wavelet analysis tool for PDL. The vast wealth of toolboxes for Matlab is another major selling point for Matlab. Again it could be done in PDL, but you may have to do it yourself.

Now, here's a list of what PDL does better than Matlab:

Basically put, PDL has an incredible base upon which to build a numerical suite. As far as I am aware, PDL's data-flow and "pdl-threading" capabilities are unmatched anywhere, including Python and C++. They are also very poorly explained, which is why they are so under-appreciated. To perpetuate the problem, I will not attempt to explain them here because I don't know them well enough myself to explain them to a beginner. But there are other advantages.

Here's why I left Matlab: file handling. Handling nontrivial filenames in Matlab is a huge pain. Yes, they do trivial, trivial file globbing and they offer regular expressions, but I only thought to look for these after I switched to Perl. It didn't even occurr to me as a possibility while I was a regular Matlab user. If you need to process a server log and extract lots of data from it, you wouldn't even think to write the extraction portion in Matlab. This is Perl's bread and butter, and if you use PDL you can include the numerical analysis in the same script.

The difference in cost between Matlab and PDL is pretty big, too. This is a major sellng point for PDL, so to speak. I'm told Matlab's price can impede collaboration, though I've never actually seen it happen. Still it's nice to have free (beer) software.

The reason those cool features that I'll point at but won't get into - data-flow and "pdl-threading" - the reason those are possible is PDL::PP. This is a very advanced feature of PDL, but it is a major difference between PDL and Matlab. In Matlab, you can write C functions and compile them into MEX files, which is nice. You can do the same for plane-old Perl, too, using a variation of C called XS. What's more, you can do this inline with your code using the Inline module. It's all very nice. But with PDL, it's even better.

I mentioned that with PDL and Matlab, you try to replace for loops with vector operations. In PDL::PP, you specify what those vector operrations should do, using loops and all, with C-like code (it eventually gets turned into C code and gets compiled), but PDL::PP HANDLES HIGHER DIMENSIONS FOR YOU, as well as multiple data types. It is far easier to write custom-rolled C-code for PDL than for Matlab. The only problem is that the documentation for PDL::PP is difficult to get through, assumes you're a PDL expert, and you're willing to give the documentation a number of reads in order to get it. (By the way, the documentation that Toumas and company did write has a nice style and covers a lot of material, but let's just say it's not quite like reading the Camel book.)

Finally, PDL has a great group of developers. They're just nice people. I'm sure you could find the same thing with Matlab, but I never even thought to look. Perl is just so much more of a culture-oriented language.

Recap: what does PDL have the Matlab doesn't? It's based in Perl, it's got all of CPAN at its disposal, it's got the amazing PDL::PP at its core, its free, and it's got cool developers. Where does Matlab win out? Great visualization, huge user base, clean IDE, and a large tool set.

PDL is great

tsee on 2009-11-04T13:50:29

My only exposure to Matlab has been writing a replacement for Matlab and IDL tools in Perl, so I can't and won't comment on it.

But PDL is great. I once wrote a 20 line PDL-based tool that could do the same task (with the same performance because the PDL devs are better C programmers than I am) as a 1k line C/C++ app. Note that here, I went the other way: I used PDL for prototyping and then rewrote in C because I needed to *efficiently* pass data from PDL to SDL. That was hard from Perl. I do have to admit that ten of the 20 lines are probably the most dense code I ever produced. There is more brain in there than in a convoluted regular expression.

What I really want to stress is: PDL is awesome and much under-appreciated. Thank you for writing about your experience.